Seeping Semantics: Linking Datasets using Word Embeddings for Data Discovery
نویسندگان
چکیده
Employees that spend more time finding relevant data than analyzing it suffer a data discovery problem. The large volume of data in enterprises, and sometimes the lack of knowledge of the schemas aggravates this problem. Similar to how we navigate the Web today, we propose to identify semantic links that assist analysts in their discovery tasks. These links relate tables to each other, to facilitate navigating the schemas. They also relate data to external data sources such as ontologies and dictionaries, to help explain the schema meaning. We materialize the links in an enterprise knowledge graph, where they become available to analysts. The main challenge is how to find pairs of objects that are semantically related. We propose SEMPROP, a DAG of different components that find links based on syntactic and semantic similarities. SEMPROP is commanded by a semantic matcher which leverages word embeddings to find objects that are semantically related. To leverage word embeddings, we introduce coherent groups, a novel technique to combine them which works better than other state of the art alternatives for this problem. We implement SEMPROP as part of a discovery system we are building and conduct user studies, real deployments and a quantitative evaluation to understand the benefits of links for data discovery tasks, as well as the benefits of SEMPROP and coherent groups to find those links.
منابع مشابه
Learning Word Embeddings from Tagging Data: A methodological comparison
The semantics hidden in natural language are an essential building block for a common language understanding needed in areas like NLP or the Semantic Web. Such information is hidden for example in lightweight knowledge representations such as tagging systems and folksonomies. While extracting relatedness from tagging systems shows promising results, the extracted information is often encoded in...
متن کاملEstimating the Parameters for Linking Unstandardized References with the Matrix Comparator
This paper discusses recent research on methods for estimating configuration parameters for the Matrix Comparator used for linking unstandardized or heterogeneously standardized references. The matrix comparator computes the aggregate similarity between the tokens (words) in a pair of references. The two most critical parameters for the matrix comparator for obtaining the best linking results a...
متن کاملTurkish entity discovery with word embeddings
Entity-linking systems link noun phrase mentions in a text to their corresponding knowledge base entities in order to enrich a text with metadata. Wikipedia is a popular and comprehensive knowledge base that is widely used in entity-linking systems. However, long-tail entities are not popular enough to have their own Wikipedia articles. Therefore, a knowledge base created by using Wikipedia ent...
متن کاملRadical-Based Hierarchical Embeddings for Chinese Sentiment Analysis at Sentence Level
Text representation in Chinese sentiment analysis is usually working at word or character level. In this paper, we prove that radical-level processing could greatly improve sentiment classification performance. In particular, we propose two types of Chinese radical-based hierarchical embeddings. The embeddings incorporate not only semantics at radical and character level, but also sentiment inf...
متن کاملLeveraging Distributional Semantics for Multi-Label Learning
We present a novel and scalable label embedding framework for large-scale multi-label learning a.k.a ExMLDS (Extreme Multi-Label Learning using Distributional Semantics). Our approach draws inspiration from ideas rooted in distributional semantics, specifically the Skip Gram Negative Sampling (SGNS) approach, widely used to learn word embeddings for natural language processing tasks. Learning s...
متن کامل